Fine-tuning an LLM on your texts: part 2 - exploring your text data

This is the next installment in my guide to train an LLM on your text message history, as I did over the holidays across my 240,000 messages. In this post, you find out more than you ever wanted to about your own text messages… The previous post is here.

In this part and the next, we will cleanse and curate a dataset of documents that are well suited to Natural Language Generation. This may not sound like the most gripping part of the journey – but actually I had a ton of fun with this stage. In addition, perfecting the data format turned out to have a far greater impact on the results than endlessly tweaking hyper-parameters.

I decided to carry out all my work formatting my text messages locally on my laptop. Then I encrypted the data before uploading to Hugging Face. This shouldn’t be strictly necessary, as the data is kept private by Hugging Face, but I wanted to take this extra precaution to keep my texts safe, and you might want to too.

Getting your hands on the data

Bring up a Jupyter Notebook or Jupyter Lab on your local machine (for example, by creating a virtual env and doing pip install jupyterlab then jupyter lab). Then get your environment ready for show time:

# constants
MAX_LENGTH = 200  # The length of each chunk
DATA_NAME = "your-hf-username/messagesv1"  # for uploading
ME = "Edward" # your name here!
base_model_name = "meta-llama/Llama-2-7b-chat-hf"  # for the tokenizer

# installs
!pip install ipywidgets datasets cryptography torch transformers sentencepiece matplotlib wordcloud

# imports
import csv
import datetime
import random
from collections import Counter
import datasets
from cryptography.fernet import Fernet  # to encrypt our texts
import tqdm
import torch
from transformers import AutoTokenizer
import matplotlib
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline

Put the CSV exports from iMazing into the root directory of your notebook, and run this, updating with your filenames:

txts = []
filenames = ['all_text_messages.csv', 'all_whatsapp_messages.csv']
for filename in filenames:
    with open(filename, newline='') as csvfile:
        reader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
        txts.extend(reader)
        
print(f'Read in a total of {len(txts):,} messages')

# For me, this outputs: Read in a total of 266,337 messages

It’s worth taking some time to dig in to the data to spot problems — here’s one I found and fixed right away:

INTRO = 'Messages to this chat and calls are now secured with end-to-end encryption'
txts = [txt for txt in txts if txt['Text'] != INTRO]
print(f'Now a total of {len(txts):,} messages')

The messages are currently dicts, and at this point, I convert them to objects for convenience. I filter out messages from groups and people not in my contacts. I also cleanse the text a bit — such as swapping colons with semi-colons, because later we’ll use colons to have a special meaning in the training data.

class Message:

    def __init__(self, chat_session, message_type, text, when):
        self.name = chat_session
        self.sender = self.name if message_type == 'Incoming' else ME
        self.receiver = ME if message_type == 'Incoming' else self.name
        self.text = text
        self.when = datetime.datetime.strptime(when, '%Y-%m-%d %H:%M:%S')
        self.massage_text()

    def massage_text(self):
        # Replace special characters used in our format for training
        self.text = self.text.replace('\n','  ').replace(':',';').replace('#',';')
        # Indicate if the message is an image
        if self.text == '': self.text = '***'
    
    def should_exclude(self):
        return any(ch in self.name for ch in '+&,') or all(ch.isdigit() for ch in self.name)

# Create lists of messages

messages = [Message(t['Chat Session'], t['Type'], t['Text'], t['Message Date']) for t in txts]
messages = [m for m in messages if not m.should_exclude()]
print(f'A total of {len(messages):,} messages')

# For me, this outputs: A total of 242,230 messages

Organize the data

We now organize the messages into a dictionary, keyed by the name of the person we’re chatting with. The values are lists of Message objects, sorted by earliest message first. This should interweave text and WhatsApp messages.

# Organize into dict with key = chat name, value = list of messages

chats = {}
for message in messages:
    if message.name not in chats:
        chats[message.name] = []
    chats[message.name].append(message)

# Sort the chats by time

for message_list in chats.values():
    message_list.sort(key = lambda m: m.when)

print(f'{sum([len(v) for v in chats.values()]):,} messages with {len(chats)} people')

# Gives me 242,230 messages with 472 people

I decided that I needed to exchange at least 20 text messages with someone for them to be included. This had a marginal affect on the message count, but a large effect on the number of unique people. Definitely something to experiment with.

AT_LEAST = 20
chats = {name: messages for name, messages in chats.items() if len(messages)>=AT_LEAST}
print(f'{sum([len(v) for v in chats.values()]):,} messages with {len(chats)} people')

# Gives me 240,985 messages with 290 people

Investigate the data

It’s always essential to examine your data to look for trends and anomalies. In this case, it’s also super interesting! I started by looking to see how often I’ve texted through the years — guess when I met my partner..

# Prepare data
dates = [message.when for message in messages]

# Plot
fig, ax = plt.subplots(1, 1)
plt.title("How many texts I've sent over time")
ax.set_xlabel('Year')
ax.set_ylabel('How many texts');
ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda y, p: format(int(y), ',')))
_ = ax.hist(dates, bins=20, color='purple', rwidth=0.5)

And if you’re curious to see who you’ve texted the most often:

# Prepare data
counter = Counter(message.name for message in messages)
results = counter.most_common(40)
names, counts = zip(*results)

# Plot
fig, ax = plt.subplots(1, 1, figsize = (10, 5))
ax.set_ylabel('How many texts');
ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda y, p: format(int(y), ',')))
plt.xticks(range(len(names)), names, rotation='vertical')
_ = ax.bar(names, counts, color ='teal', width = 0.5)

These are my results, with the names masked!

I’ve saved the best to last! Now we create a word cloud of messages to get some insights. For my version here, I’ve limited to the top 50 words to avoid revealing some of my unfortunate nicknames.. but it’s great fun to investigate beyond 50 words! Also try filtering the messages on just you, or on chats with specific people, to see how the tone of your conversations changes.

# Prepare data
text = ' '.join([message.text for message in messages])

# Alternatively - only my sent messages, or only chats with 1 person
# text = ' '.join([message.text for message in messages if message.sender==ME])
# text = ' '.join([message.text for message in chats['recipient name here']])

# Plot
wordcloud = WordCloud(max_font_size=60, max_words=50).generate(text)
plt.figure(figsize = (10, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Prep for next time: tokens

In the next post, we’ll start to work with tokens instead of text. If you’re not familiar with tokenization, now would be an excellent time to check out the fantastic video from Jon Krohn that I posted last week. In preparation for next time, we should investigate the Tokenizer used by the Llama 2 model.

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

We can use the methods encode, decode and convert_ids_to_tokens to investigate how text maps to tokens, and some of the quirks of tokenization.

tokens = tokenizer.encode('### Edward: hello me ### Edward: hi me')
print(tokens)
print(tokenizer.decode(tokens))
print(tokenizer.convert_ids_to_tokens(tokens))

From experimenting with the inputs, you’ll notice a few things:

The tokenizer adds a beginning of sentence token <s> with value 1
If you change the input to put each message on a new line (as in, '### Edward: hello me\n### Edward: hi me') then the the second ### is represented differently from the first (broken into 2 tokens). I tried to structure the data to avoid this kind of inconsistency.
You’ll notice that the prompt tags <<SYS>> <</SYS>> and [INST] [/INST] don’t have special tokens; they get tokenized like any other text. I was surprised by this, as were others on the internet… apparently the use of these tags is just a convention that’s been used during training.

Next time, we’ll pack our text messages into datasets, encrypt them and upload to Hugging Face, ready for us to fine-tune our model using QLoRA. The next installment is here.

2 responses to “Fine-tuning an LLM on your texts: part 2 – exploring your text data”

Ankush Singal

February 15, 2024 at 12:31 am

Thanks for sharing this amazing article, do you have the sample data available so that i can try on my own.

1. Edward Donner
  
  February 15, 2024 at 11:27 am
  
  Hey Ankush. Thanks for your feedback. Unfortunately my data is quite sensitive as it contains all my personal text messages! I’m hoping you can follow the steps in this post and the previous one to create your own dataset.